Identifying myoglobin as a mediator of diabetic kidney disease: a machine learning-based cross-sectional study

In view of the alarming increase in the burden of diabetes mellitus (DM) today, a rising number of patients with diabetic kidney disease (DKD) is forecasted. Current DKD predictive models often lack reliable biomarkers and perform poorly. In this regard, serum myoglobin (Mb) identified by machine learning (ML) may become a potential DKD indicator. We aimed to elucidate the significance of serum Mb in the pathogenesis of DKD. Electronic health record data from a total of 728 hospitalized patients with DM (286 DKD vs. 442 non-DKD) were used. We developed DKD ML models incorporating serum Mb and metabolic syndrome (MetS) components (insulin resistance and β-cell function, glucose, lipid) while using SHapley Additive exPlanation (SHAP) to interpret features. Restricted cubic spline (RCS) models were applied to evaluate the relationship between serum Mb and DKD. Serum Mb-mediated renal function impairment induced by MetS components was verified by causal mediation effect analysis. The area under the receiver operating characteristic curve of the DKD machine learning models incorporating serum Mb and MetS components reached 0.85. Feature importance analysis and SHAP showed that serum Mb and MetS components were important features. Further RCS models of DKD showed that the odds ratio was greater than 1 when serum Mb was > 80. Serum Mb showed a significant indirect effect in renal function impairment when using MetS components such as HOMA-IR, HGI and HDL-C/TC as a reason. Moderately elevated serum Mb is associated with the risk of DKD. Serum Mb may mediate MetS component-caused renal function impairment.

Data preparation. From the EHRs of T2DM, we extracted a total of 157 features in the following 7 domains: demographic information, hospitalization-related information, general information, laboratory tests, diagnosis, medication, and others. The data domain categories and descriptions are listed in Table 2. The details of feature calculation are shown in Table S1.
Diagnoses were organized using the ICD -9 and ICD-10 hierarchies. These diseases were diagnosed by experienced doctors who were licensed to practice medicine. T2DM was diagnosed according to the World Health Organization (WHO) criteria (1999) 24 . DKD was diagnosed by the NKF/KDOQI (2007) classification 25 . Serum Mb was tested by Mb-LATEX CN from Denka Seiken (Reference range 0-70 ng/mL).
ML algorithms. Two popular ML methods, extreme gradient boosting (XGBoost) and random forest (RF), were used to build DKD classification models based on EHR data. RF, first proposed by Breiman in 2001 35 , is an ensemble of unpruned classification or regression trees created using bootstrap samples of training data and random feature selection in tree induction. XGBoost is an efficient and scalable implementation of the gradient boosting framework 36 . It is used to develop models in a sequential stagewise fashion like other boosting methods and generalize them by allowing optimization of arbitrary differentiable loss functions. RF and XGBoost were applied using the randomForest (4. [6][7][8][9][10][11][12][13][14] 37 and xgboost (1.3.2.1) 38 R packages, respectively. All ML methods used grid search and fivefold cross-validation for parameter adjustment. The final hyperparameters for the DKD models are listed in Table S2. (SP), positive predictive value (PPV), negative predictive value (NPV), recall, F1, and balanced accuracy (BA), were used to analyze the discrimination ability of DKD ML models. The probability scores output by the models were used to calculate the area under the receiver operating characteristic (ROC) curve (AUC). These statistical parameters are defined as follows:

Feature evaluation and model interpretation.
Feature importance of RF was estimated by (1) mean decrease in prediction accuracy without the feature in the model and (2) mean decrease in the Gini index, a measure of impurity of the dataset (i.e. risk of misclassifying data), by including the feature. For both parameters, the higher the score, the more important the feature. We evaluated feature contributions to model predictions using an interpretability method called SHapley Additive exPlanation (SHAP). SHAP is based on the Shapley value, a concept introduced in the 1950s to measure each player's contribution to a collaborative game 39 . In 2017, Lundberg and Lee proposed a broadly applicable SHAP method to interpret a variety of models, including regression and classification 40,41 . For binary classification in XGBoost, the output of the model is the log odds ratio, that is, the SHAP values sum up to the log loss of the model for each sample. The function is defined as follows  120 min C-peptide, TC, TG, LDL-C, HDL-C, HDL-C/TC ratio, HGI, TyG index, TG/HDL ratio,  HOMA-IR, HOMA-BETA, IGI, IGI/HOMA-IR, I Diagnoses T2DM, hyperlipidemia, DKD, CKD, stages of CKD, renal anemia, renal bone disease, DR, stages of DR, DN, diabetic foot, CAD, heart failure, stages of heart failure, extremity arteriosclerosis, carotid atherosclerosis, renal artery stenosis and sclerosis, cerebral arteriosclerosis, retinal arteriosclerosis, acute MI, old MI, congestive heart failure, stroke, end-stage renal disease, sepsis, septicemia, HT, stages of HT, peptic ulcer, fatty liver disease, cirrhosis, viral hepatitis, chronic bronchitis, COPD, asthma, arrhythmia, atrial flutter and atrial fibrillation, pregnancy period, hyperthyroidism, hypothyroidism, hyperparathyroidism, hashimoto's thyroiditis, brain tumors, breast tumors, lung tumors, digestive system tumors, urinary system tumors, thyroid tumors, tumors of the male reproductive system, tumors of the female reproductive system, leukemia, lymphoma, metastasis, respiratory system infection www.nature.com/scientificreports/ where z is the log odds ratio, x ′ is a simplified binary feature vector containing values 1 or 0 for observed and missing features, M is the number of simplified input features, and ϕ i is the attribution value of the ith molecular descriptor or the SHAP value. In probability theory, z is the corresponding value of probability p after a certain transformation, and the transformation function is defined as follows where p varies monotonically in the range (0, 1) and z varies from infinitesimal to infinity. A feature is predicted to be a risk factor if its z-value is greater than 0. The larger the z-value, the more likely it is a risk factor rather than a protective factor. Therefore, the SHAP method can help us understand the model more intuitively. The SHAPforxgboost (0.1.1) and fastshap (0.0.5) R packages were applied in the analysis.
Decision curve analysis (DCA). We used DCA to assess the clinical utility of ML models. A decision curve is generated by plotting net benefit against a range of threshold probabilities. Decision curves include benefits and harms on the same scale, so they can be directly compared, supporting comparisons between models. The net benefit is calculated as where N is the total sample size and P t represents the threshold probability. Two extreme strategies, classifying for all and classifying for none, were added for reference. The R package rmda (1.6) was used.

RCS models. The association between serum Mb and DKD was evaluated on a continuous scale with RCS
curves based on logistic regression models. To balance best fit and overfitting, the number of knots was chosen empirically to be 4. The serum Mb associated with the lowest risk of DKD was the concentration with the lowest odds ratio on the spline curve. Odds ratios were adjusted for gender, age, BMI, hyperlipidemia, hypertension, age-adjusted Charlson comorbidity index (ACCI), and hospitalization date.
Causal mediation analysis. Causal mediation analysis identifies potential pathways to explain the observed association between a given exposure and a particular outcome 42  Differences between groups were based on the Wilcoxon test for continuous variables and the chi-square test for categorical variables. Relationships between MetS components, serum Mb, and DKD were assessed with the Spearman's correlation coefficient (r s ), controlling for 7 variables: gender, age, BMI, hyperlipidemia, hypertension, ACCI, and admission date. All statistical significance was defined as P < 0.1. Missingness was addressed by setting missing categorical data to 0, whereas for numerical data, imputation was completed using the predictive mean matching (PMM) strategy.
Bias. Observer bias may arise from the data on patients hospitalized in a grade A tertiary general hospital, as these patients may have more severe than average DM or DKD. Besides, selection bias caused by exclusion of UMALB missing individuals brings DKD rate elevation but comparable cardiovascular risks, leading to limited representation (Table S3). However, the significance and possible mediator role of serum Mb in DKD, as a common rule, was mainly focused in our study, and we believe that the bias did not greatly affect our conclusion.

Results
Participant characteristics. The clinical characteristics of the participants are presented in Table 1. There were no significant between-group differences in age, sex, and BMI. However, the DKD group showed significantly more diabetes complications such as diabetic retinopathy and diabetic neuropathy. Importantly, serum Mb was significantly elevated in the DKD group, consistent with previous studies 45 .

DKD model performance.
To verify and evaluate the significance of serum Mb in DKD, we built DKD ML models incorporating serum Mb, taking advantage of its freedom from collinearity in processing highdimensional EHR data on disease phenotypes and clinical outcomes [46][47][48] , aiming to discover important features overlooked by traditional methods. We first accessed the model performance to ensure that the following feature importance analysis and model interpretation are accurate and reliable. After removing features with a missing ratio of more than 40% or zero variance, a total of 88 important features (including renal function) were available for DKD modeling to obtain high-accuracy DKD classification models. Two popular ML algorithms, XGBoost and RF algorithms, have been applied, named XGBoost-88 and RF-88, respectively. The model performances in the testing set are summarized in Table 3 and the ROC curves are shown in Fig. 2A. Overall, both approaches performed satisfactorily (e.g. all AUCs > 0.85). Clearly, the best classifier based on the RF algorithm was able to achieve AUC values of 0.884, BA values of 0.798, and PPV values of 0.854 in the testing set.
We also applied XGBoost and RF algorithms to renal function exclusion dataset (contained 83 features and removed UMALB, serum creatinine (Scr), eGFR, blood urea nitrogen (BUN), and serum uric acid (SUA)), aiming to find the features hidden by renal function features. XGBoost-83 and RF-83 were named respectively. Model performances in the testing set are summarized in Table 3 and ROC curves are shown in Fig. 2B. On the whole, the performances of both approaches were relatively satisfactory (e.g. all AUCs > 0.80). Compared with the model including renal function features, BA decreased by about 0.1, SP did not change much, and SE decreased by 0. DCA for ML models. To assess the clinical utility of several ML models we built, DCA was applied to these ML models (Fig. 2C,D). For both XGBoost/RF-88 and XGBoost/RF-83, the net benefits of two models were similar across the range of threshold probabilities. When the threshold probability was between 0.4 and 0.6, both models outperformed the reference strategies. These results of DCA show the consistency and efficiency of our models and are consistent with our model performance data (Table 3).
Feature importance analysis and model interpretation. Next, to identify important features hidden from indicators of renal function and attest the importance of serum Mb for DKD, feature importance analysis and model interpretation were applied, the reliability of which has been proved by model performance and DCA.
The top 30 important variables for RF-88 and RF-83 are presented in Fig. 3A,B, and two estimators, mean decrease accuracy and mean decrease Gini, have been applied to measure feature importance. In RF-88, UMALB, eGFR, and Scr were listed as top-ranked features. It's worth noting that serum Mb consistently ranked in the front. Meanwhile, MetS components such as I/G 120 min, I/G 0 min, HOMA-BETA, HOMA-IR, TyG index, Gutt index, 120 min C-peptide (reflecting IR and β-cell function), HGI, fasting glucose (reflecting glucose metabolism), and LDL-C, TG/HDL ratio (reflecting lipid metabolism) were on the list. Intriguingly, in RF-83, serum Mb emerged as the most important feature under all estimators. Meanwhile, MetS components such as 120 min C-peptide, fasting C-peptide, IGI, IGI/HOMA-IR, HOMA-BETA, HOMA-IR, I/G 0 min, TyG index (reflecting IR and β-cell function), fasting glucose (reflecting glucose metabolism), LDL-C, HDL-C (reflecting lipid metabolism) were relatively important features.
To overcome the black-box nature of XGBoost and RF algorithms and better understand the prediction models, model interpretation based on XGBoost-88 correct classifications was carried out using the SHAP approach. The top 30 important features identified by SHAP are shown in Fig. 3C. We found that the renal function features UMALB, Scr, and eGFR remained dominant. Serum Mb (ranked 13th) was recognized as an important feature. MetS components, including HOMA-IR, fasting C-peptide, 120 min C-peptide, IGI/HOMA-IR, HOMA-BETA, I/G 0 min, Gutt index (reflecting IR and β-cell function), 120 min glucose, fasting glucose, HGI (reflecting glucose metabolism), and HDL-C (reflecting lipid metabolism) were important features. These results are consistent with the feature importance analysis described above. We noted that higher serum Mb, worsening renal functions (i.e. higher UMALB and Scr and lower eGFR), more diabetes comorbidities (i.e. diabetic neuropathy and diabetic www.nature.com/scientificreports/ retinopathy), worsening IR and β-cell function (higher IGI/HOMA-IR, AC, HOMA-BETA, and I/G 0 min), and worse glucose control (e.g. higher HGI and AGLU) were associated with DKD. Our results suggest that in addition to renal function features, serum Mb is a hidden important feature in DKD, and that elevated serum Mb is associated with DKD classification. Meanwhile, MetS components also play a key role in helping to build ML decision-making algorithms in DKD.

RCS identifies the association between serum Mb and DKD. By using ML algorithm-based variable
importance analysis and SHAP, we verified the importance of serum Mb for DKD discrimination, supporting serum Mb as a potential important feature of DKD. Next, we further evaluated the association between serum Mb and DKD using RCS on a continuous scale (Fig. 4A,B). We found that the odds ratio was greater than 1 when serum Mb was greater than about 80, indicating that moderately elevated serum Mb is associated with DKD risk. Similar results were seen in male and female patients. However, when adjusting for eGFR, serum Mb was not significant associated with DKD across the range (Fig S2), indicating a strong correlation between serum Mb and eGFR.

Causal mediation analysis reflects serum Mb as a possible mediator of MetS component-induced renal function impairment.
Based on the aforementioned ML variable importance analysis, SHAP approach, and RCS results, we noticed that serum Mb and MetS components were closely related to DKD. However, studies have shown that serum Mb is elevated in the case of metabolism disorders, and Mb may directly or indirectly cause tissue damage. Thus, we systematically evaluated whether serum Mb mediates renal impairment associated with the MetS components (20 indicators), and the most significant results are shown in Fig. 5. The spearman correlation coefficient between MetS components and among MetS components, serum Mb, and renal function indicators are shown in Fig. S3 and Table S4, indicating that each component was irreplaceable. Renal function impairment (or DKD severity) was evaluated by eGFR and UMALB. www.nature.com/scientificreports/ Significant IE and TE were found in 8 out of 20 causal mediation models when evaluating renal impairment using eGFR across all observations. Six indicators (HGI, 120 min glucose, fasting glucose, HOMA-IR, IGI/ HOMA-IR and IGI in descending order of proportion mediated, the same below) reflecting IR, β-cell function, and glucose metabolism in the DKD subgroup and two indicators (HDL-C, HDL-C/TC) reflecting lipid metabolism in the non-DKD subgroup showed significant IE and TE (see Table 4 for significant results and  Table S5 for full results).
Significant IE and TE were found in 3 out of 20 causal mediation models when evaluating renal impairment using UMALB across all observations. 5 indicators (Gutt index, I/G 0 min, 120 min glucose, GHBA1C, and HGI) www.nature.com/scientificreports/ reflecting IR, β-cell function, and glucose metabolism in the DKD subgroup and no indicator in the non-DKD subgroup showed significant IE and TE (see Table 5 for significant results and Table S6 for full results). Based on the above systematic evaluation, we support that serum Mb is a potential mediator of MetS component-induced renal function impairment. Generally, compared with UMALB, an indicator of renal function impairment, eGFR has more significant IE and TE as well as higher mediated proportion, indicating serum Mb as a mediator mainly participates in the decline of eGFR, that is, the later phase of DKD progression. The results also suggest that lipid metabolism-induced renal function impairment through serum Mb is important in the non-DKD subgroup, whereas IR, β-cell function, and glucose metabolism are important in the DKD subgroup. These may indicate different causes of elevated serum Mb, and therefore different glycemic and/or lipid control strategies should be employed.

Discussion
Serum Mb is a potential risk factor for DKD. Current DKD prediction models based on large cohorts of T2DM patients, such as ADVANCE study (2012) 13 , Diabetes Cohort Study (DCS) (2013) 49 , and National Diabetes Register (NDR) (2011) 44 , are generally limited by lack of biomarkers and poor performance. A recently  www.nature.com/scientificreports/ observed relationship between elevated serum Mb and DKD suggests that serum Mb may be a risk factor for DKD 43,45 . However, to our knowledge, no studies have evaluated serum Mb as a risk factor for DKD prediction, nor has serum Mb been used as a risk factor for DKD model construction, partially due to limitations in methods to address collinearity. We first verified the importance of serum Mb in DKD. To find those features that might be overlooked, the ML algorithms were chosen because they are not constrained by collinearity when processing high-dimensional EHR data. Apart from common DKD-related features, such as renal function features and fasting glucose, we also specially developed ML models including serum Mb. Our DKD models had comparable performance to other models, achieving an AUC of 0.85 in the presence of renal function features and 0.80 after removal of renal function features. Further, in our study, feature importance analysis and SHAP have recognized serum Mb as an important feature. Equally important, RCS found that moderately elevated serum Mb was associated with the risk of DKD, which is consistent with previous findings [17][18][19][20] . Therefore, we believe that serum Mb is a potential risk factor for DKD and deserves more attention and further study.

Serum Mb may mediate MetS component-induced renal function impairment.
Current DKD management focuses solely on blood pressure control, glycemic and lipid control, weight loss, and diet [50][51][52] , in other words, we emphasize the strong association between MetS components and DKD, but more efforts are needed to find appropriate treatment for DKD 53,54 . Recently, elevated Mb have been observed in DM and DKD. In a 200 controlled study of 188 DM patients, serum Mb levels in DM patients were elevated and were much higher in DM patients with metabolic syndrome 16 . In a cross-sectional study of DKD, serum Mb was found to be positively correlated with DKD in a model adjusted for age, gender, cardiovascular risk factors, glycemic control, and blood sugar 45 . In another study, serum Mb was found to be related to CKD, and elevated serum Mb levels were related to advanced stage of CKD 43 . Interestingly, the elevation of serum Mb cannot be explained by Statins  www.nature.com/scientificreports/ uses 55 . The cause of elevated serum Mb may be due to abnormal glucose and lipid metabolism and diabetic muscle tissue damage 17,18 . In diabetic rats, liraglutide and insulin increased carnitine palmitoyl transferase 1 (CPT1) expression, decreased Mb, and improved cardiovascular function 19 . In a skeletal muscle-specific (Cpt1bm-/-) mouse model, a 25% fat diet resulted in increased intramuscular lipids and serum Mb and decreased mouse vitality 20 . Further, at the molecular level, Mb interacts with glucose to form AGEs 33,56 , which are active contributors to cellular stress and chronic inflammation 34,57 . Also, oxidative modification of Mb induced by stress-related protein carbonyl compounds is an important cause for the development of microvascular and macrovascular complications in diabetes [21][22][23] .
These cumulative evidence indicate that serum Mb has a mediator role in DKD. Confirming the role of serum Mb in DKD progression may provide a thorough comprehension of DKD pathogenesis and a promising therapeutic target for DKD. However, as far as we known, the causal mediation relationship between serum Mb and MetS components and renal function impairment has not been reported.
In our study, feature importance analysis, SHAP, and RCS revealed that MetS components and elevated serum Mb were closely related to DKD. In causal mediation analysis, significant IE and TE were found using either UMALB or eGFR as indicators of renal function impairment, indicating serum Mb as a mediator in MetS component-induced renal function impairment. Thus, we believe that more studies on the role of serum Mb in DKD, based on multilevel research designs and emerging strategies, will provide insights into the pathogenesis of DKD and reveal new treatment targets for DKD.
Study Limitations. Our findings should be further verified in cohort studies, as the strength of the evidence we generated is limited by the cross-sectional study design. External validity was difficult to perform in our study because serum Mb was not a commonly used indicator in DKD studies. Further, it is necessary to gain insights into the pathophysiological mechanisms at the cellular and molecular levels to reveal the causal relationship between MetS components, serum Mb, and renal function impairment.

Data availability
The datasets generated during and/or analyzed during the current study are not publicly available usage due to the other ongoing projects but are available from the corresponding author on reasonable request.